과목: 실험통계학 - 중간고사 대체과제
담당교수: 구준모 교수님
학과: 기계공학과
학번: 2016100724
제출일: 2021.05.05. (WED)
이름: 박수빈
1. 공기청정기 가동여부에 따라 I/O Ratio의 평균으로 고려하는 경우
(a) 데이터 확인
- 필요한 라이브러리 호출
import numpy as np
import pandas as pd
import scipy as sp
from scipy import stats
import plotly.express as px
import plotly.io as pio
from plotly import graph_objs as go
import plotly.figure_factory as ff
import chart_studio.plotly as py
import random
import statsmodels.formula.api as smf
from scipy.optimize import minimize- 미세먼지 데이터 불러온 후, 데이터 확인
# 미세먼지 데이터 가져오기
PM_measured = pd.read_csv("office_data.csv")
PM_measured = PM_measured.dropna() # 결측치(NA) 제거
print(PM_measured)## dayYear PM10_indoor PM2.5_indoor PM10_outdoor PM2.5_outdoor PURI
## 0 729.001389 6.0 6.0 54.20 45.10 0
## 1 729.004861 6.1 6.1 53.50 44.20 0
## 2 729.008333 6.1 6.1 52.80 44.00 0
## 3 729.011806 6.0 6.0 53.70 43.80 0
## 4 729.015278 6.0 6.0 53.20 43.90 0
## ... ... ... ... ... ... ...
## 6235 750.700694 1.2 1.1 14.42 6.94 1
## 6236 750.704167 1.2 1.1 15.00 7.08 1
## 6237 750.707639 1.2 1.2 14.72 7.08 1
## 6238 750.711111 1.1 1.1 14.70 7.00 1
## 6239 750.714583 1.1 1.0 15.50 7.06 1
##
## [6240 rows x 6 columns]
# 데이터 개형 확인하기
pd.options.plotting.backend = "plotly"
pio.templates.default = "presentation"
PM_fig = px.line(PM_measured, x = "dayYear", y = ["PM10_outdoor", "PM10_indoor", "PM2.5_outdoor", "PM2.5_indoor"], line_group = "PURI", title = "Historical PM Data", labels={"dayYear": "Day", "value": "Concentration (ug/m3)"})
PM_fig = PM_fig.update_layout(
font_family="Times New Roman",
font_color="black",
font_size = 18,
title_font_family="Times New Roman",
title_font_color="black",
title_font_size = 24,
legend_title_font_color="black"
)
PM_fig.write_html("Fig1.html")# 데이터 박스플롯 확인하기
pio.templates.default = "presentation"
fig = px.box(PM_measured, y = ["PM10_outdoor", "PM2.5_outdoor", "PM10_indoor", "PM2.5_indoor"], color = "PURI", labels={"variable": "Size", "value": "Concentration (ug/m3)"})
fig = fig.update_layout(
font_family="Times New Roman",
font_color="black",
font_size = 18,
title_font_family="Times New Roman",
title_font_color="black",
title_font_size = 24,
legend_title_font_color="black"
)
fig.write_html("Fig2.html")(b) 각 평균의 신뢰구간을 고려하여 실내외의 먼지 농도가 통계적으로 유의미하게 차이나는지 분석
- 공기청정기 가동여부에 따라 그룹화한 후, 평균과 불편분산 확인
PM_grouped = PM_measured.groupby("PURI")
PM_grouped.size()## PURI
## 0 3867
## 1 2373
## dtype: int64
PM_mean = PM_grouped.mean()
PM_var = PM_grouped.var(ddof = 1)
print(PM_mean); print(PM_var)## dayYear PM10_indoor PM2.5_indoor PM10_outdoor PM2.5_outdoor
## PURI
## 0 735.730384 6.349909 6.127592 18.362054 10.284668
## 1 746.586644 2.134471 2.002107 36.415659 19.585485
## dayYear PM10_indoor PM2.5_indoor PM10_outdoor PM2.5_outdoor
## PURI
## 0 15.101897 5.824322 5.420881 106.820093 61.182525
## 1 5.708430 1.374275 1.112786 527.551883 103.175443
- I/O Ratio에 대한 데이터로 나타낸 후, 평균과 표준편차 확인
# 데이터 변형
PM_off = PM_measured[PM_measured["PURI"] == 0]
PM_on = PM_measured[PM_measured["PURI"] == 1]
PM_off_io10 = PM_off["PM10_indoor"] / PM_off["PM10_outdoor"]
PM_off_io25 = PM_off["PM2.5_indoor"] / PM_off["PM2.5_outdoor"]
PM_on_io10 = PM_on["PM10_indoor"] / PM_on["PM10_outdoor"]
PM_on_io25 = PM_on["PM2.5_indoor"] / PM_on["PM2.5_outdoor"]
PM_off_ratio = pd.DataFrame({"PM10": PM_off_io10, "PM2.5": PM_off_io25})
PM_on_ratio = pd.DataFrame({"PM10": PM_on_io10, "PM2.5": PM_on_io25})# I/O Ratio의 평균
PM_off_mean = PM_off_ratio.mean()
PM_on_mean = PM_on_ratio.mean()
print(PM_off_mean); print(PM_on_mean)## PM10 0.412098
## PM2.5 0.740559
## dtype: float64
## PM10 0.068713
## PM2.5 0.111222
## dtype: float64
# I/O Ratio의 표준편차
PM_off_std = PM_off_ratio.std(ddof = 1)
PM_on_std = PM_on_ratio.std(ddof = 1)
print(PM_off_std); print(PM_on_std)## PM10 0.170310
## PM2.5 0.288033
## dtype: float64
## PM10 0.033774
## PM2.5 0.044983
## dtype: float64
평균의 신뢰구간 고려: t-검정 이용 (유의수준 = 0.05)
귀무가설: I/O Ratio의 평균이 1이다. (실내외 먼지 농도가 통계적으로 유의미하게 차이나지 않는다)
대립가설: I/O Ratio의 평균이 1이 아니다. (실내외 먼지 농도가 통계적으로 유의미한 차이를 보인다)
# I/O Ratio 데이터의 길이와 표준오차 구하기
size = np.array([len(PM_off_ratio), len(PM_on_ratio)])
df = size - 1
PM_off_se = PM_off_std / np.sqrt(size[0])
PM_on_se = PM_on_std / np.sqrt(size[1])# 공기청정기 가동 여부, 입자 크기에 따라 구분하여 t-test
PM_off_t_10 = stats.ttest_1samp(PM_off_ratio["PM10"], 1)
PM_off_t_25 = stats.ttest_1samp(PM_off_ratio["PM2.5"], 1)
PM_on_t_10 = stats.ttest_1samp(PM_on_ratio["PM10"], 1)
PM_on_t_25 = stats.ttest_1samp(PM_on_ratio["PM2.5"], 1)
matrix_tsample = [
['', 't-value', 'p-value'],
['PM10_off', PM_off_t_10[0], PM_off_t_10[1]],
['PM2.5_off', PM_off_t_25[0], PM_off_t_10[1]],
['PM10_on', PM_on_t_10[0], PM_on_t_10[1]],
['PM2.5_on', PM_on_t_25[0], PM_on_t_10[1]]
]
PM_t_table = ff.create_table(matrix_tsample, index=True)
PM_t_table.write_html("Table1.html")# 평균의 t-분포에 대한 95% 신뢰구간과 t-값 비교
confint1 = stats.t.interval(alpha = 0.95, df = df[0], loc = PM_off_mean[0], scale = PM_off_se[0])
confint2 = stats.t.interval(alpha = 0.95, df = df[0], loc = PM_off_mean[1], scale = PM_off_se[1])
confint3 = stats.t.interval(alpha = 0.95, df = df[1], loc = PM_on_mean[0], scale = PM_on_se[0])
confint4 = stats.t.interval(alpha = 0.95, df = df[1], loc = PM_on_mean[1], scale = PM_on_se[1])
matrix_conf = [
['Confint Table', 't-value', "Lower", "Upper"],
['PM10_off', PM_off_t_10[0], confint1[0], confint1[1]],
['PM2.5_off', PM_off_t_25[0], confint2[0], confint2[1]],
['PM10_on', PM_on_t_10[0], confint3[0], confint3[1]],
['PM2.5_off', PM_on_t_25[0], confint4[0], confint4[1]]
]
conf_table = ff.create_table(matrix_conf, index = True)
conf_table.write_html("Table2.html")(c) 두 비의 차이를 이용해 공기청정기 효과를 통계적으로 분석
- 점추정 필요: 두 비의 평균값의 차이와 표준오차 확인
# PM10 평균 비교
PM_off_mean["PM10"] - PM_on_mean["PM10"]## 0.3433858682283977
# PM10 표준오차
print(PM_off_se["PM10"]); print(PM_on_se["PM10"])## 0.002738753954937508
## 0.0006933256088557901
# PM2.5 평균 비교
PM_off_mean["PM2.5"] - PM_on_mean["PM2.5"]## 0.6293367572119647
# PM2.5 표준오차
print(PM_off_se["PM2.5"]); print(PM_on_se["PM2.5"])## 0.004631861475851101
## 0.0009234151662967842
2. 주어진 시간에서의 I/O Ratio를 구하고 공기청정기 가동여부에 따라 비교
- 시간에 따른 I/O Ratio로 데이터프레임 생성 후 확인
# Dataframe 생성
PM_ratio = pd.DataFrame({"PM10":PM_measured["PM10_indoor"] / PM_measured["PM10_outdoor"], "PM2.5": PM_measured["PM2.5_indoor"] / PM_measured["PM2.5_outdoor"], "PURI": PM_measured["PURI"]})
PM_ratio["dayYear"] = PM_measured["dayYear"]# 입자 크기별로 나누어 I/O Ratio를 나타낸 그래프
PM_fig2 = px.line(PM_ratio, x = "dayYear", y = ["PM10", "PM2.5"], line_group = "PURI", title = "Historical PM Ratio", labels={"dayYear": "Day", "value": "I/O Ratio (Cin/Cout)", "variable": "Size"})
PM_fig2 = PM_fig2.update_layout(
font_family="Times New Roman",
font_color="black",
font_size = 18,
title_font_family="Times New Roman",
title_font_color="black",
title_font_size = 24,
legend_title_font_color="black"
)
PM_fig2.write_html("Fig3.html")# Box plot 확인
fig2 = px.box(PM_ratio, y = ["PM10", "PM2.5"], color = "PURI", labels={"variable": "Size", "value": "I/O Ratio (Cin/Cout)"})
fig2 = fig2.update_layout(
font_family="Times New Roman",
font_color="black",
font_size = 18,
title_font_family="Times New Roman",
title_font_color="black",
title_font_size = 24,
legend_title_font_color="black"
)
fig2.write_html("Fig4.html")두 비의 차로 나타내어 분석: t-검정 - 단측검정 (유의수준 = 0.05)
기준: I/O Ratio의 차이 vs “0”
귀무가설: 두 비의 차이가 같다.
대립가설: 두 비의 차이가 0보다 크다 or 작다.
# 두 데이터의 길이가 다름 -> 샘플링 필요 (무작위로 수행)
PM_ratio_off = PM_ratio[PM_ratio["PURI"] == 0]
PM_ratio_on = PM_ratio[PM_ratio["PURI"] == 1]
PM_ratio_off2 = PM_ratio_off.sample(len(PM_ratio_on), replace = False)
# index 초기화
PM_ratio_off2 = PM_ratio_off2.reset_index()
PM_ratio_on = PM_ratio_on.reset_index()# 공기 청정기를 껐을 때의 I/O Ratio - 켰을 때의 I/O Ratio
PM_ratio_sub = PM_ratio_off2 - PM_ratio_on
PM_ratio_size = len(PM_ratio_sub)
PM_ratio_df = PM_ratio_size - 1
PM_ratio_mean = PM_ratio_sub.mean()
PM_ratio_se = PM_ratio_sub.std(ddof = 1) / np.sqrt(PM_ratio_size)
print(PM_ratio_mean); print(PM_ratio_se)## index -3125.348925
## PM10 0.339883
## PM2.5 0.626490
## PURI -1.000000
## dayYear -10.874826
## dtype: float64
## index 26.735553
## PM10 0.003533
## PM2.5 0.006056
## PURI 0.000000
## dayYear 0.093105
## dtype: float64
# 95% 유의수준에서 신뢰구간 형성
confint_ratio = stats.t.interval(alpha = 0.95, df = PM_ratio_df, loc = PM_ratio_mean[1], scale = PM_ratio_se[1])
confint_ratio2 = stats.t.interval(alpha = 0.95, df = PM_ratio_df, loc = PM_ratio_mean[2], scale = PM_ratio_se[2])
# 두 비의 차이와 0을 비교하는 t-test
PM_ratio_ttest10 = stats.ttest_1samp(PM_ratio_sub["PM10"], 0)
PM_ratio_ttest25 = stats.ttest_1samp(PM_ratio_sub["PM2.5"], 0)
matrix_conf2 = [
['Confint Table', 't-value', "Lower", "Upper"],
['PM10', PM_ratio_ttest10[0], confint_ratio[0], confint_ratio[1]],
['PM2.5', PM_ratio_ttest25[0], confint_ratio2[0], confint_ratio2[1]]
]
# 결과 확인
conf_table2 = ff.create_table(matrix_conf2, index = True)
conf_table2.write_html("Table3.html")3. 회귀분석을 이용해서 외부 (초)미세먼지 농도에 따라 실내 미세먼지 농도가 어떻게 될 지를 통계적으로 분석
모델: \(\frac{dC_{in}}{dt} = ACH \cdot P \cdot C_{out} (t) - (ACH + k) \cdot C_{in} (t) + \dot{S}\)
재실인원, 문 개폐 여부에 대한 정보가 주어지지 않음
따라서 하루 단위로 나누어 입자 입경별로, 그리고 공기청정기 가동 여부에 따라 분석 진행
# 데이터 불러오기
data <- read.csv("office_data.csv")
data <- data[complete.cases(data),]ODE: Trapezoidal Method로 풀이
- \(C_{in,n+1}=C_{in,n} + \frac{h}{2}\cdot \left (f(t_n, C_{in,n}) + f(t_{n+1}, C_{in,n+1}) \right )\)
# Cin을 구하는 함수 만들기
pred_Cin <- function(parameters, t, Cin, Cout){
ACH_P <- parameters[[1]]
ACH_k <- parameters[[2]]
S <- parameters[[3]]
pred <- Cin[1]
for (i in seq(1, length(t) - 1)){
h <- t[i+1] - t[i]
pred[i+1] <- (pred[i] + h / 2 * (ACH_P * (Cout[i] + Cout[i+1]) - ACH_k * pred[i] + 2 * S)) / (1 + h / 2 * ACH_k)
}
return (data.frame(dayYear = t, PM_measured = Cin, PM_predicted = pred))
}# 최소제곱법을 활용하기 위해 모델 예측값과 실측값 사이의 잔차를 구하는 함수 만들기
resid <- function(parameters, t, Cin, Cout){
temp <- pred_Cin(parameters, t, Cin, Cout)
return (sum((temp$PM_measured - temp$PM_predicted)^2))
}# 공기청정기 가동 여부에 따라 데이터를 구분하고, 하루단위로 모드 설정
data_off <- data[which(data$PURI == 0),]
data_on <- data[which(data$PURI == 1),]
data_off$mode <- 0
for (i in seq(14, 1)){
data_off[which(data_off$dayYear <= 729 + i),]$mode <- i
}
data_on$mode <- 0
for (i in seq(9, 1)){
data_on[which(data_on$dayYear <= 742 + i),]$mode <- i
}- scipy.optimize.minimize 대신, R의 DEoptim 사용
# Optimization의 Bound 설정
library(DEoptim)
Lp <- c(0, 0, 0)
Up <- c(100, 100, 100)# PM10, PURI = 0
param1 <- NULL
temp <- NULL
for (i in seq(1, 14)){
temp <- data_off[which(data_off$mode == i),]
opt <- DEoptim(resid, Lp, Up, control = DEoptim.control(trace = FALSE, NP = 100, itermax = 500, parallelType = 1, parVar = c("resid", "pred_Cin")), t = temp$dayYear, Cin = temp$PM10_indoor, Cout = temp$PM10_outdoor)
param1 <- rbind(param1, unlist((opt$optim$bestmem)))
}
param1 <- as.data.frame(param1)
model_data1 <- NULL
for (i in seq(1, 14)){
model_data1 <- rbind(model_data1, pred_Cin(param1[i,], data_off$dayYear[which(data_off$mode == i)], data_off$PM10_indoor[which(data_off$mode == i)], data_off$PM10_outdoor[which(data_off$mode == i)]))
}# PM2.5, PURI = 0
param2 <- NULL
temp <- NULL
for (i in seq(1, 14)){
temp <- data_off[which(data_off$mode == i),]
opt <- DEoptim(resid, Lp, Up, control = DEoptim.control(trace = FALSE, NP = 100, itermax = 500, parallelType = 1, parVar = c("resid", "pred_Cin")), t = temp$dayYear, Cin = temp$PM2.5_indoor, Cout = temp$PM2.5_outdoor)
param2 <- rbind(param2, unlist((opt$optim$bestmem)))
}
param2 <- as.data.frame(param2)
model_data2 <- NULL
for (i in seq(1, 14)){
model_data2 <- rbind(model_data2, pred_Cin(param2[i,], data_off$dayYear[which(data_off$mode == i)], data_off$PM2.5_indoor[which(data_off$mode == i)], data_off$PM2.5_outdoor[which(data_off$mode == i)]))
}# PM10, PURI = 1
param3 <- NULL
temp <- NULL
for (i in seq(1, 9)){
temp <- data_on[which(data_on$mode == i),]
opt <- DEoptim(resid, Lp, Up, control = DEoptim.control(trace = FALSE, NP = 100, itermax = 500, parallelType = 1, parVar = c("resid", "pred_Cin")), t = temp$dayYear, Cin = temp$PM10_indoor, Cout = temp$PM10_outdoor)
param3 <- rbind(param3, unlist((opt$optim$bestmem)))
}
param3 <- as.data.frame(param3)
model_data3 <- NULL
for (i in seq(1, 9)){
model_data3 <- rbind(model_data3, pred_Cin(param3[i,], data_on$dayYear[which(data_on$mode == i)], data_on$PM10_indoor[which(data_on$mode == i)], data_on$PM10_outdoor[which(data_on$mode == i)]))
}# PM2.5, PURI = 1
param4 <- NULL
temp <- NULL
for (i in seq(1, 9)){
temp <- data_on[which(data_on$mode == i),]
opt <- DEoptim(resid, Lp, Up, control = DEoptim.control(trace = FALSE, NP = 100, itermax = 500, parallelType = 1, parVar = c("resid", "pred_Cin")), t = temp$dayYear, Cin = temp$PM2.5_indoor, Cout = temp$PM2.5_outdoor)
param4 <- rbind(param4, unlist((opt$optim$bestmem)))
}
param4 <- as.data.frame(param4)
model_data4 <- NULL
for (i in seq(1, 9)){
model_data4 <- rbind(model_data4, pred_Cin(param4[i,], data_on$dayYear[which(data_on$mode == i)], data_on$PM2.5_indoor[which(data_on$mode == i)], data_on$PM2.5_outdoor[which(data_on$mode == i)]))
}- 결정된 모델 계수
param <- rbind(param1, param2, param3, param4)
names(param) <- c("ACH_P", "ACH_k", "S")
param## ACH_P ACH_k S
## 1 9.328329e-17 2.564634e-15 6.699289e+00
## 2 2.179061e-15 9.809872e-01 1.813316e-16
## 3 2.567834e+00 8.179470e+00 4.007757e-15
## 4 6.603540e+00 1.978024e+01 3.406573e-14
## 5 1.011317e+00 4.738561e+00 5.405171e-15
## 6 1.375071e-16 1.102722e+01 6.004932e+01
## 7 3.957095e+00 2.210519e+01 1.000000e+02
## 8 5.788817e+00 1.893160e+01 4.836396e+01
## 9 6.583253e+00 1.194040e+01 9.574926e-15
## 10 2.503989e+00 1.947362e+01 6.861569e+01
## 11 6.772222e+00 2.306236e+01 5.223411e+01
## 12 9.571093e-16 5.273472e+00 3.170811e+01
## 13 2.246358e-01 1.536079e+01 1.000000e+02
## 14 1.248887e-14 6.833919e+00 6.324004e+01
## 15 3.632864e-16 7.957497e-16 5.729020e+00
## 16 1.274366e+00 2.347041e+00 2.040001e-14
## 17 1.348886e+01 3.070851e+01 4.558789e+01
## 18 2.364825e+00 4.916577e+00 1.841579e-14
## 19 2.963524e+00 5.344704e+00 6.342729e-16
## 20 7.129675e+00 1.393974e+01 3.798708e-12
## 21 3.001183e+00 4.344621e+00 1.202283e-13
## 22 9.811753e+00 2.704086e+01 1.000000e+02
## 23 3.378944e+01 3.608690e+01 1.080628e-13
## 24 5.590697e+00 1.089545e+01 2.199322e+01
## 25 1.118405e+01 1.159817e+01 2.767587e-13
## 26 5.455330e-15 1.012792e+01 5.848473e+01
## 27 8.791260e-01 1.624850e+01 1.000000e+02
## 28 4.681856e-16 7.061785e+00 6.496993e+01
## 29 6.864763e+00 1.000000e+02 4.201423e+01
## 30 1.241156e+00 1.453465e+01 1.164515e-14
## 31 7.350835e-17 3.736645e+01 1.000000e+02
## 32 1.190448e+00 8.244459e+01 1.000000e+02
## 33 4.229658e-02 1.463209e-13 5.144498e-14
## 34 1.045657e+00 1.335381e+01 8.275653e+00
## 35 3.190458e-01 4.589139e+00 2.476456e-15
## 36 1.025227e+00 1.374493e+01 2.473571e-15
## 37 1.410475e+00 2.202637e+01 9.950422e-16
## 38 9.600310e+00 1.000000e+02 1.747207e+01
## 39 1.590195e+00 1.588400e+01 2.970043e-14
## 40 1.053695e-15 3.447842e+01 8.636807e+01
## 41 2.249060e+00 4.037713e+01 2.536631e+01
## 42 1.505802e+00 1.617964e+01 9.093305e-15
## 43 1.551116e+00 1.244401e+01 3.340662e+00
## 44 5.871379e-01 5.933324e+00 6.976404e-15
## 45 1.015290e+00 8.685282e+00 2.216446e-15
## 46 3.438771e+00 2.440936e+01 1.979982e-15
- 결정한 모델로부터 예측 정확도 평가: 결정계수 (\(r^2\))
getRSQ <- function(Exp, Pred){
rmd=which(is.na(Pred))
if(length(rmd)>0){
Exp=Exp[-rmd]
Pred=Pred[-rmd]
}
St <- sum((Exp-mean(Exp))^2)
S <- sum((Exp-Pred)^2)
return((St-S)/St)
}
rsq1 <- getRSQ(model_data1$PM_measured, model_data1$PM_predicted) # PM10, off
rsq2 <- getRSQ(model_data2$PM_measured, model_data2$PM_predicted) # PM2.5, off
rsq3 <- getRSQ(model_data3$PM_measured, model_data3$PM_predicted) # PM10, on
rsq4 <- getRSQ(model_data4$PM_measured, model_data4$PM_predicted) # PM2.5, on
rsq <- data.frame("PM10_off" = rsq1, "PM2.5_off" = rsq2, "PM10_on" = rsq3, "PM2.5_on" = rsq4)
rsq## PM10_off PM2.5_off PM10_on PM2.5_on
## 1 0.6899945 0.6935907 0.7838796 0.9074512
- 모델 피팅 결과와 측정값 비교
# 모델 예측값들로 데이터프레임 구성
model_data <- rbind(cbind(model_data1, model_data2), cbind(model_data3, model_data4))
names(model_data) <- c("dayYear", "PM10_measured", "PM10_predicted", "dayYear2", "PM2.5_measured", "PM2.5_predicted")
model_data$PURI <- data$PURI# 그래프 그리기
library(plotly)
fig <- plot_ly(model_data, x = ~dayYear, y = ~PM10_measured, type = "scatter", mode = "lines", name = "PM10_measured", line = list(width = 3, dash = "dot")) %>%
add_trace(y = ~PM10_predicted, name = "PM10_predicted", line = list(width = 3, dash = "dash")) %>%
add_trace(y = ~PM2.5_predicted, name = "PM2.5_predicted", line = list(width = 3)) %>%
add_trace(y = ~PM2.5_measured, name = "PM2.5_measured", line = list(width = 3, dash = "dash")) %>%
layout(title = "Comparison of Measured and Predicted Data",
xaxis = list(title = "Day", zeroline = TRUE),
yaxis = list(title = "Concentration (ug/m3)"))
fig